Cluster validation

We need to evaluate Clustering analysis (for Community structure, see Evaluation of community detection methods) methods. Because of the unsupervised nature, it is often harder to evaluate clustering methods.

The validation approaches can be categorized into three major ones: external evaluation, internal evaluation, and relative evaluation¹. The general criteria are compactness, connectedness, and separation. Compactness means that the members of a cluster should be close to each other. It is good to identify spherical clusters but may fail to detect connected clusters. Connectedness means that the members of a cluster should be very close to some other members of the cluster and the cluster should form connected set in the space. Separation indicates that two different clusters should be well-separated from each other.

External evaluation uses pre-classified items or gold standards to validate the clustering results. The results depend on the benchmark used and thus can have biases². The evaluation does not guarantee real-world performance.

Internal evaluation normally uses intra-group similarity vs. inter-group similarity.

Relative evaluation compares different methods or different parameters.

With information theory

Information theory for cluster analysis

To read

References

Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis (2001). “On Clustering Validation Techniques”. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS 17: 107–145. doi:10.1023/A:1012801612483. http://www.springerlink.com/content/k43h06u025w2x4q6/. ↩
Handl, J.; Knowles, J.; Kell, DB. (Aug 2005). “Computational cluster validation in post-genomic data analysis.”. Bioinformatics 21 (15): 3201-12. doi:10.1093/bioinformatics/bti517. PMID 15914541. ↩